\documentclass[10pt]{article}

\usepackage{graphicx}      % Enable graphics commands
\usepackage{lscape}		% Enable landscape with \begin{landscape} until \end{landscape}
\usepackage[section]{placeins} % Keep tables and figures within their own sections
\usepackage{natbib}			% Enable citation commands \citep{}, \citet{}, etc.
\bibpunct{(}{)}{;}{a}{}{,}		% Formatting for in-text citations
\usepackage{setspace}		% Enable double-spacing with \begin{spacing}{2} until \end{spacing}.
\usepackage[utf8]{inputenc} 	% Enable utf8 characters, i.e., accents without coding--just type them in.
\usepackage[english]{babel}	% English hyphenation and alphabetization.  Other languages available.
\usepackage{dcolumn}        % For decimal-aligned stargazer output.
\usepackage[colorlinks=true, urlcolor=blue, citecolor=black, linkcolor=black]{hyperref} % Include hyperlinks with the \url and \href commands.
\setlength{\tabcolsep}{1pt}	% Make tables slightly narrower by reducing space between columns.

\renewcommand\floatpagefraction{.9}	% These commands allow larger tables and graphics to fit
\renewcommand\topfraction{.9}		% on a page when default settings would complain.
\renewcommand\bottomfraction{.9}
\renewcommand\textfraction{.1}
\setcounter{totalnumber}{50}
\setcounter{topnumber}{50}
\setcounter{bottomnumber}{50}

\newcommand{\R}{\textsf{R}~}        %This creates the command \R to typeset the name R correctly.

%\usepackage[left=1in, right=1in]{geometry}	%Turn footnotes into endnotes (commented out).
%\renewcommand{\footnotesize}{\normalsize}	
%\usepackage{endnotes}
%\renewcommand{\footnote}{\endnote}
%\renewcommand{\section}{\subsection}
\usepackage{fullpage}

\begin{document}


\title{The Standardized World Income Inequality Database}		
\author{
    Frederick Solt\\
    \href{mailto:frederick-solt@uiowa.edu}{frederick-solt@uiowa.edu}
}
\date{}				
\maketitle

\begin{spacing}{2}

\begin{abstract}
\emph{Objective.}  Since 2008, the Standardized World Income Inequality Database (SWIID) has provided income inequality data that seek to maximize comparability while providing the broadest possible coverage of countries and years.  This article describes the current SWIID's construction, highlighting differences from its original version, and re-evaluates the SWIID's utility to cross-national income inequality research in light of recently available alternatives.  \emph{Methods.}  Coverage of inequality datasets is assessed across country-years; comparability is evaluated in terms of success in predicting the Luxembourg Income Study (LIS), recognized in the field as the gold standard in comparability, before those data are released. \emph{Results.}  The SWIID offers coverage double that of the next largest income inequality dataset, and its record of comparability is three to eight times better than those of alternate datasets.  \emph{Conclusions.}  As its coverage and comparability far exceed those of the alternatives, the SWIID remains better suited for broadly cross-national research on income inequality than other available sources.  
\end{abstract}


\newpage					


<<data_setup, echo=FALSE, results='hide', warning=FALSE, message=FALSE>>==
library(foreign)
library(plyr)
library(countrycode)
library(reshape2)
# library(XML)
library(ggplot2)

#datapath <- "~/Documents/Projects/Data/"
@

Interest in income inequality, its causes, and its consequences have increased dramatically in recent years among both scholars and the public.  To make valid comparisons of levels and trends in income inequality across countries and over time, however, one must have comparable data.  Although there is a large quantity of data on inequality available for cross-national and over-time analyses, unfortunately most of these data are simply not comparable due to differences in the population covered, in terms of geography, age, and employment status; the welfare definition employed, such as market income or consumption; the equivalence scale applied, such as household per capita or household adult equivalent; and the treatment of various other items, such as non-monetary income and imputed rents.  The Standardized World Income Inequality Database (SWIID) was introduced in 2008 to provide researchers with income inequality data that maximize comparability for the broadest possible sample of countries and years \citep[see][]{Solt2009}.  

While it retains that same goal, the SWIID has evolved and expanded considerably since that time.  After briefly reviewing the problem of comparability in cross-national income inequality data, this article explains how the current version of the SWIID addresses the issue, noting the ways in which its construction has changed from the original version.  To evaluate the SWIID's utility to researchers, the article then offers an assessment of the SWIID's performance in comparison to alternatives that have become available since the SWIID was first published.   It concludes with an explanation of how to use the SWIID data in cross-national analyses.


\section*{The Problem of Comparability}


<<qq, echo=FALSE, results='hide', cache=TRUE, warning=FALSE>>==
qq <- structure(list(Dataset = c(" LIS", "", "", "WIID2c", "WIID3b", 
    "", "D&S Accept", "OECD", "SWIID Source", ""), Observations = c(232L, 
    2043L, 315L, 1705L, 2041L, 430L, 682L, 400L, 2812L, 183L), Definitions = c(1L, 
    8L, 1L, 11L, 11L, 1L, 10L, 1L, 13L, 1L), Harmonized = c(1L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Dataset1 = c("", "All the Ginis", 
    "SEDLAC", "", "", "Eurostat", "", "", "", "CEPALStat")), .Names = c("Dataset", 
    "Observations", "Definitions", "Harmonized", "Dataset1"), class = "data.frame", row.names = c(NA, 
    -10L))

pdf(file="qq.pdf",width=8, height=5.25)
ggplot(qq, aes(x=Definitions-1, y=Observations, color=2-qq$Harmonized)) + geom_point() + 
    geom_text(size=5, aes(x=Definitions-1, y=Observations, label=Dataset, color=2-qq$Harmonized), hjust=-.1) +
	geom_text(size=5, aes(x=Definitions-1, y=Observations, label=Dataset1, color=2-qq$Harmonized), hjust=1.1) +
	scale_colour_identity() + 
	labs(x = "Additional Combinations of Wefare Definition and Equivalence Scale", y = "Country-Years Observed") +
	coord_cartesian(ylim = c(0, 3000), xlim = c(-3,16)) +
	theme(axis.title.y = element_text(size = rel(1.5))) +
	theme(axis.title.x = element_text(size = rel(1.5))) + theme_bw()
graphics.off()
@

[Figure \ref{F:qq} about here.]

\nocite{Eurostat2014, OECD2014, CEPALStat2014}

The tradeoff between coverage and comparability is readily evident in Figure~\ref{F:qq}, which graphs the number of country-year observations of the Gini index provided by different cross-national datasets against the number of different combinations of welfare definition and income scale used in calculating these statistics.  The Luxembourg Income Study is the only source that provides inequality statistics calculated using a uniform set of assumptions and definitions on the basis of microdata that has been painstakingly harmonized to maximize its comparability.  Its reputation as the gold standard of cross-nationally comparable inequality data is well deserved, but at present it provides only 232 country-years of data in 41 countries.  More observations are available from other sources at the cost of sacrificing the benefits to comparability of harmonization.  Eurostat, for example, provides 430 country-years (in 32 European countries) of inequality in household disposable income per adult equivalent. Even more observations are available from sources composed of observations calculated using different welfare definitions or income scales.  The source data used to generate the SWIID, described in more detail below, draws on all of the sources depicted here as well as national statistical offices and the scholarly literature.  It now comprises more than 10,000 Gini indices in over 2800 country-years in 174 countries, but these Ginis are calculated on the basis of eleven different combinations of welfare definition and income scale.

This tradeoff suggests researchers have two principal options, neither very attractive.  The first is to maximize comparability.  This can be done either by insisting on only the harmonized data of the LIS or data from some other single source or, at some sacrifice, by using data generated using only a single basis of calculation.  Either way, privileging comparability entails giving up on making many comparisons and throwing away most of the available information.  This is true even for fairly recent years in the most data-rich part of the world, Europe: depending on only Eurostat data to provide information about the context of inequality in which the first four waves of the European Social Survey were conducted (2002-2009) would result in missing data for more than one-fourth of the country-years in the sample \citep[see][]{Solt2015}.  Even the least stringent approach, combining data sources that use a single combination of welfare definition and equivalence scale, yields just 1128 country-year observations of disposable-income per-adult-equivalent Ginis in the SWIID source data, only about 40\% of the total country-years available.\footnote{\doublespacing A third option, abandoning the Gini index and similar summaries of the entire income distribution and adopting instead a different conceptualization of income inequality that may allow for more and more comparable data to be brought to bear, has been taken by two prominent efforts, the University of Texas Inequality Project and the World Top Incomes Database.  These efforts will be discussed further below.}

The second is to increase coverage by using more of the available data and making a global fixed adjustments to account for the average differences between statistics based on different calculations. (An often employed but even less defensible variant, of course, is to simply ignore the incomparability altogether; for a recent example, see \citet{Halter2014}.)  \citet[8]{Milanovic2013} stressed the incomparability of the observations included in his \emph{All the Ginis} dataset and included a series of dummy variables indicating whether a particular observation was based on a welfare definition of gross income, net income, or expenditure and whether it was calculated using the unadjusted household or household per capita income as the equivalence scale.  In line with the recommendations of \citet[582]{Deininger1996}, he advised using these dummies to make adjustments for each of these characteristics.

Although the recommendation to calculate such global fixed adjustments is straightforward to implement, it does not satisfactorily deal with the incomparability of these inequality statistics.  The difference between inequality across household incomes and across household incomes per capita will depend on the relationship between income and household size; to the extent this relationship varies across countries and over time, a fixed adjustment will underestimate inequality in some country-years and overestimate it in others.  Similarly, the difference between inequality in net income and in expenditures depends on patterns of savings and consumption across households; these patterns are known to be different across countries \citep[see, e.g.,][]{Kirsanova2007}.  The difference between gross- and net-income inequality reflects the progressivity of the tax code and any variation in compliance by income, both of which are also well understood to vary considerably across countries and years \citep[see, e.g.,][]{Forster2014}.    

An example can best illustrate the severe limitations of combining observations calculated on different bases using only global fixed adjustments.  Consider the question of whether income inequality is higher in China or India posed by \citet[102]{Mukhopadhaya2011}.  The left panel of Figure~\ref{F:adj} shows unadjusted data for the two countries since 1985 from the \texttt{Giniall} series of Milanovic's \citeyear{Milanovic2013} \emph{All the Ginis} dataset.  It suggests that inequality in both countries was similarly moderate until the mid-1990s, when levels in China increased dramatically and became much higher than those in India. 

<<atg, echo=FALSE, results='hide', cache=TRUE, warning=FALSE, message=FALSE>>==
atg <- read.csv("allginis-CHN+IND.csv")
lis_ci <- data.frame(country=c("China", "India"), year=c(2002, 2004), g=c(50.5,49.1), se=c(.425,.238), g2=c(40.3,27.3))

atg2 <- merge(atg, lis_ci, by=c("country", "year"), all.x=T)

adj1 <- ggplot(data=atg2, aes(x=year, y=Giniall, colour=country)) + geom_line() +    
    coord_cartesian(xlim=c(1985,2012),ylim = c(15, 55)) +
	labs(x = "Year", y = "Gini Index") +
	geom_text(aes(2000,31,label = "India", colour="India"), size=4.5) +
	geom_text(aes(1997,47,label = "China", colour="China"), size=4.5) +
    theme_bw() + theme(legend.position="none")	

adj2 <- ggplot(data=atg2, aes(x=year, y=Giniall2, colour=country)) + geom_line() +   
	coord_cartesian(xlim=c(1985,2012),ylim = c(15, 55)) +
	labs(x = "Year", y = "Gini Index") +
	geom_text(aes(1997,25.75,label = "adj. India", colour="India"), size=4.5) +
	geom_text(aes(1992.5,39.5,label = "adj. China", colour="China"), size=4.5) +
    theme_bw() + theme(legend.position="none")

adj3 <- ggplot(data=atg2, aes(x=year, y=Giniall2, colour=country)) + geom_line() +
	coord_cartesian(xlim=c(1985,2012),ylim = c(15, 55)) +
	labs(x = "Year", y = "Gini Index") +
	geom_text(aes(1997,25.75,label = "adj. India", colour="India"), size=4.5) +
	geom_text(aes(1992.5,39.5,label = "adj. China", colour="China"), size=4.5)	+
	geom_point(aes(x=year, y=g, colour=country)) +
	geom_errorbar(aes(ymin=g-1.96*se, ymax=g+1.96*se), width=.1) +
	geom_point(aes(x=year, y=g2, colour=country)) +
    geom_errorbar(aes(ymin=g-1.96*se, ymax=g+1.96*se), width=.1) +
    geom_text(aes(2004,47.5,label = "LIS India", colour="India"), size=4.5) +
    geom_text(aes(2002,52.5,label = "LIS China", colour="China"), size=4.5) +
    theme_bw() + theme(legend.position="none")    	

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
    require(grid)
    
    # Make a list from the ... arguments and plotlist
    plots <- c(list(...), plotlist)
    
    numPlots = length(plots)
    
    # If layout is NULL, then use 'cols' to determine layout
    if (is.null(layout)) {
        # Make the panel
        # ncol: Number of columns of plots
        # nrow: Number of rows needed, calculated from # of cols
        layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                         ncol = cols, nrow = ceiling(numPlots/cols))
    }
    
    if (numPlots==1) {
        print(plots[[1]])
        
    } else {
        # Set up the page
        grid.newpage()
        pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
        
        # Make each plot, in the correct location
        for (i in 1:numPlots) {
            # Get the i,j matrix positions of the regions that contain this subplot
            matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
            
            print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                            layout.pos.col = matchidx$col))
        }
    }
}

pdf(file="adj.pdf",width=8, height=5.25)
multiplot(adj1, adj2, adj3, layout = matrix(c(1, 2, 3), nrow=1))
graphics.off()

@

[Figure~\ref{F:adj} about here.]

\texttt{Giniall} is the preferred series in the dataset, but even in this limited subsample it includes observations of inequality in gross income per capita, net income per capita, and expenditures per capita, so ``an adjustment for each of these characteristics is desirable'' \citep[8]{Milanovic2013}.  A linear regression of \texttt{Giniall} on the included dummies for consumption (versus income), household (versus per capita), and gross income (versus net income) across the entire dataset indicates that to make a consistent series of net-income per capita inequality, one should subtract 5.1 points from all consumption-based observations, add 2.4 points to all household-based observations, and subtract 13.3 points from all gross-income-based observations.  The result is presented in the center panel of Figure~\ref{F:adj}.  According to the adjusted data, inequality in China has been consistently (excepting only a single, quickly reversed dip in 2005) and considerably (by an average of about 7 points) higher than in India.

But actually comparable data reveals a very different picture.  The right panel of Figure~\ref{F:adj} adds the single data point available from the LIS for each country: from 2002 in the case of China, 2004 for India.  In the adjusted data, the difference between these two country-years is 13 points.  In the LIS, the difference is just 1.5$\pm$1.0 points; inequality in China is distinguishably but only very slightly higher than in India for these two observations.  The fixed adjustments are better than no adjustments in this case---in the unadjusted data depicted in the left panel this difference is over 21 points---but they still leave a great deal to be desired.

To make a single, global fixed adjustment in an attempt to account for differences in inequality statistics calculated on two dissimilar bases is to assume that these differences are always and everywhere the same, an assumption that is of course false.  As shown in the next section, the SWIID takes advantage of an abundance of available income inequality data to relax this assumption and so maximize comparability.


\section*{Standardizing the Available Data}

The starting point for the SWIID is two collections of Gini indices: the LIS data and the source data.  The LIS data consists of two series, one of net (that is, post-tax, post-transfer) income inequality and one of market (pre-tax, pre-transfer) income inequality.  The net-income inequality series is taken directly from the LIS Key Figures \citep{LIS2014}.  The market income inequality series is generated from the LIS microdata \citep{LIS2014a}.\footnote{\doublespacing The code employed to generate the market income inequality series is available in the online SWIID replication materials; I thank Tomas Hellebrandt of the Peterson Institute for International Economics for his valuable suggestions regarding the market-income inequality series in the LIS.}  The quality and comparability of these LIS data are unparalleled, but, as noted above, their shortcoming for broadly cross-national work is coverage.  From 1967, the year of the first LIS observation, until 2012, the year of the most recent, the LIS includes just 231 observations from 41 countries: there are twelve observations for Canada over these 45 years; seven other countries have only one.\footnote{\doublespacing I make two modifications to the coverage of the two LIS series.  First, New Zealand's privacy laws have thus far prevented the country from contributing microdata to the LIS project; I treat as LIS data four observations of net-income inequality (1982, 1986, 1991, 1996) specially prepared by \citet[73]{StatisticsNZ1999} to be comparable with the LIS Key Figures.  Second, the LIS had originally hosted data on Russia in 1992, 1995, 2000 from the Russian Academy of Sciences, but it apparently lost permission to use these data sometime in late 2011 or early 2012.  This series was replaced in the LIS with data from the Russian Longitudinal Monitoring Survey for 2000, 2004, 2007, and 2010.  The RLMS data, however, show a sharp decline in income inequality after 2004 (a fall of over 12\% by 2010) that is not reflected in any other source I have located.  I retain the original 1992 and 1995 RAS-based figures in the LIS series, but because the RLMS-based observations for Russia after 2004 seem to lack face validity, I reluctantly omit these two observations.  I thank Louis Chauvel for our conversation on this second topic.} 

The source data has the reverse set of strengths and weaknesses.  Although in the earliest versions of the SWIID it consisted of only the \citet{UNU2008} database, the source data has since expanded with each revision and now encompasses data provided by all of the major cross-national inequality databases, the national statistical offices of countries around the world, and dozens of scholarly articles.  (The source data, annotated with the original sources, can be found in the SWIID replication materials available online.)  As of version 5.0, the source data includes over ten thousand Gini indices, dating from 1960 to 2013.\footnote{\doublespacing That is, the source data includes about twice the number of Ginis as the entire revision 2c UNU-WIDER database (which includes many observations calculated on less than the entire population or without information regarding the welfare definition or equivalence scale employed), and more than 50\% more than in the whole of the recently released and partially undocumented revision 3b of that source \citep{UNU2014}.  Careful checking confirmed that the new, preliminary additions to these data include few useable Ginis not in the SWIID source data.}  Observations are only included in the source data if they are based on all or nearly all of the country's population and if there is sufficient information to identify the equivalence scale and welfare definition employed in their calculation.  Even with these restrictions, however, the differences in the way these statistics are calculated render them ill-suited for making direct comparisons.

The SWIID uses the two LIS series as baselines to which the source data are standardized.  More precisely, the source data are used to generate model-based multiple imputation estimates of the many missing observations in the LIS series (for a general discussion of model-based multiple imputation, see \citealt{Gelman2007}).  The process begins by sorting the source data into eleven categories defined by the combination of welfare definition and equivalence scale used in their calculation.\footnote{\doublespacing Earlier versions of the SWIID classified the source data using nineteen categories devised by \citet{Babones2007}.  I am grateful to Stephen P. Jenkins for pointing out that these categories were not entirely coherent.  The old categories also included several for observations for which the welfare definition was unknown; as noted above, such observations are now excluded from the source data.}  Observations in the source data are classified as using one of three different welfare definitions: (1)~net income, (2)~market income, or (3)~expenditure.\footnote{\doublespacing Observations calculated on the basis of pre-tax, post-transfer gross income are omitted from the source data when market-income series are available for the same country; otherwise, they are at present classified as market income.  I plan to split these observations out into their own separate classification when sufficient data become available.}  The source data are also classified by equivalence scale: (1)~household per capita, (2)~household adult equivalent, (3)~household unadjusted, or (4)~person.\footnote{\doublespacing Several different definitions of `household adult equivalent' appear in the source data, including the square root of household size (the definition used in the LIS Key Figures), the OECD scale, and several country-specific scales. The differences in the Gini indices based on these different definitions of adult equivalent, however, are typically very small, less than one point on the 0–100 scale. For this reason, I opt at present to treat them as a single group to facilitate the standardization process, although at the cost of slightly greater uncertainty.}  As the `person' equivalence scale is used only with information on the distribution of (pre- or post-tax) wage income, there are no observations with the expenditure-person combination of welfare definition and equivalence scale.  The remaining combinations leave the aforementioned eleven categories.  As the standard, the two series of LIS data, which are calculated on the basis of household adult equivalent (using the square root scale) for net and market income respectively, are treated as their own separate categories, bringing the total number of categories to thirteen.  Rather than choose among sources, when more than one observation is available within a category for a particular country and year, these observations are averaged.

The result of the categorization, then, is a dataset of country-year observations, each of which has data on inequality in one or more of the thirteen categories.  What is needed to generate a series with data on all countries and years from the incomplete inequality variables in thirteen categories are the ratios between each pair of variables.  If the ratio $\rho_{ab}$ between the Gini index data in categories $a$ and $b$ were known, missing observations in $a$ could be replaced simply by multiplying available data in $b$ by $\rho_{ab}$.  But as noted previously, the relationship between Gini indices with different reference units and income definitions will vary considerably from country to country and also over time depending on the extent of redistributive policies, details of tax law, patterns of consumption and savings, family structure, and other factors.  In other words, $\rho_{ab}$ is not constant but varies across countries $i$ and years $t$.  Further, $\rho_{abit}$ is only directly calculable for those pairs of categories in those countries and years for which it is not immediately useful, that is, only when data is already available in both categories for that observation.

Those ratios $\rho_{abit}$ that are directly calculable are valuable nevertheless because they provide information about what the ratios that are missing are likely to be.  Because the factors that affect these ratios---redistributive policies, patterns of consumption, and so on---tend to change only slowly over time within a given country, the best prediction for a missing ratio will be based on available data on the same ratio in the same country in proximate years, thereby minimizing any differences in these factors.  With this in mind, the ratios $\rho_{abit}$ are predicted from the results of a series of models.

First, in those countries with sufficient data, predictions are generated by loess regression, which incorporates the maximum amount of information from proximate years by fitting a smooth curve point-by-point through the available data.  Next, predictions are generated through a series of regression models.  In order of increasing availability---but also increasing uncertainty as reflected in larger standard errors---$\hat{\rho}_{abit}$ was predicted as a function of (1)~country-decade, (2)~country, (3)~region, and (4)~advanced or developing world.\footnote{\doublespacing The earliest versions of the SWIID also predicted $\hat{\rho}_{abit}$ as a function of region-decades, but these models were discovered to not contribute significantly to the final estimates and were therefore dropped for the sake of simplicity.}  The predictions of all of these models are then combined for each ratio $\rho_{abit}$, assigning each country-year the available prediction with the smallest standard error.

These predictions $\hat{\rho}_{abit}$ alone, however, do not take advantage of all of the information available in the source data.  An additional prediction of each conversion factor can be generated in a two-step process through other categories of data.  That is, the ratio of the LIS net-income data (labeled category 1) to the data in category $b$ can be calculated as the product of the ratio between data in category $a$ and category $b$ and the ratio of the LIS net-income data to data in category $b$: $\hat{\rho}_{1bit} = \hat{\rho}_{abit} \times \hat{\rho}_{1ait}$.

These two-step predictions improve upon the conversion factors predicted in one step in two ways.  First, for some combinations of $a$ and $b$, few or no observations of both categories of the Gini index are available, making modeling $\hat{\rho}_{abit}$ in one step impractical or impossible.  Second, the uncertainty in the predicted conversion factor can often be reduced by averaging the one-step prediction with one or more two-step predictions.  

Once all of the predicted ratios $\hat{\rho}_{1bit}$ are calculated, eleven series of estimates comparable with the LIS net-income series are gained by multiplying these predicted ratios by the available data in each of the eleven source-data categories.  Because each of these comparable series is incomplete, they are combined into a single variable by assigning each observation with the estimate with the least uncertainty or, when the average of some or all of the available estimates yields an even smaller standard error, this average.

A final piece of information about the income inequality in a particular country and year is gained by noting that the distribution of income within a country typically changes only slowly over time: contemporary levels of inequality should generally be very similar to levels observed in the preceding year.  With two exceptions discussed below, dramatic differences in the estimates of inequality for a given year and those preceding and following it likely reflect persisting errors in measurement.  Allowing observations to be informed by the estimates for surrounding years works to minimize such errors.  This is achieved by using the following five-year weighted moving average algorithm: $G_{it} = \frac{1}{6} \times ({G_{it-2}+G_{it-1}+(2 \times G_{it})+G_{it+1}+G_{it+2}})$.

The first exception to the foregoing regards the Luxembourg Income Study data.  Because of the very high quality of the LIS data, differences from one year to the next are unlikely to be caused by persistent measurement error, so observations from this source are therefore not adjusted with the moving average algorithm: all LIS observations are retained without change in the SWIID.\footnote{\doublespacing The LIS data still has its own measurement error, however, averaging 0.38 points in the net-income series and 0.46 points in the market-income series.  The standard errors of the LIS series were calculated by bootstrap using the LIS microdata; the code employed is available in the SWIID replication materials online.}  The second exception involves the countries of eastern Europe and the former Soviet Union during the collapse of communist rule.  The sharp increases in inequality observed in most of these countries from 1990 to 1991 would appear to be due to the profound restructuring of these countries' societies and economies rather than measurement error.  Applying the moving average algorithm to this region results in overestimates of inequality in 1989 and 1990 and underestimates in 1991 and 1992; therefore the algorithm is not used in these countries during these years.  

Simply applying the moving-average algorithm to the net-income inequality variable, however, would  lose the estimates of uncertainty associated with each observation.  Therefore, the variable is re-generated one thousand times through Monte Carlo simulation and the moving-average algorithm applied to each simulation.  

The foregoing steps yield estimates of LIS-compatible Gini indices of the distribution of net income in all country-years for which there is a Gini index in at least one of the eleven categories of source data.  To generate estimates for additional observations, information on inequality across countries and over time from two other sources, the University of Texas Inequality Project (UTIP) and the World Top Incomes Database (WTID), is incorporated. \nocite{UTIP2013}\nocite{Alvaredo2014} Both of these sources seek to address the tradeoff between coverage and comparability by employing a single consistent, though more limited, conceptualization of income inequality: the UTIP measures differences in average pay between industrial classifications as reported by the United Nations Industrial Development Organization \citep{Galbraith2009}, while the WTID consists of the share of taxable income reported on various top fractiles of personal tax returns \citep{Atkinson2011}.  This approach has the advantage of allowing many country-years to be observed. The WTID includes nearly one thousand country-years since 1960; the UTIP over four thousand.  Each of these sources, however, has distinct drawbacks for making comparisons across countries in the level of income inequality.  For the UTIP data, comparability is compromised by differences across countries in the share of all employment that is in agriculture or services rather than industry, the share of earnings differences that occurs within rather than between industrial classifications, and the share of income accruing to capital rather than labor.  For the WTID, the problems include substantial differences across countries in the definitions of taxable income and the tax unit as well as in the prevalence of tax avoidance and evasion across incomes; these differences led the dataset's compilers to use it only to compare the trends over time across countries, not levels \citep[see][4-5 and passim]{Atkinson2011}.\footnote{\doublespacing I am grateful to Facuno Alvaredo for conversations underscoring how poorly cross-national differences in the WTID data correspond to cross-national differences in inequality measures from other sources.}  All of these differences, however, should change either slowly (e.g., the size of the industrial workforce) or relatively rarely (e.g., significant reforms to the tax code) making these sources appropriate for comparisons within a given country over time.  On these grounds, the relationship within each country between the LIS-compatible Gini indices already estimated and any available UTIP or WTID data over time is estimated using loess regression.\footnote{\doublespacing Due to differences in the scale and dispersion of these variables, all are logarithmically transformed in these analyses.  To capture the uncertainty in the LIS-compatible Ginis, the analyses are repeated using ten Monte Carlo simulations and the results combined.}  These analyses are then used to predict a LIS-compatible estimate of the Gini index with standard error for each of those observations with UTIP or WTID data but without information in the source data; for observations with data in both the UTIP and the WTID, the estimate with the smallest standard error is used.  Monte Carlo simulation is again used to generate one thousand simulated series drawn from a distribution with mean equal to the point estimate for each country-year and with a standard deviation equal to the standard error for that estimate.

In the final step in generating the SWIID's net-income inequality series, values for all post-1975 country-years still without estimates but between country-years with estimates were then interpolated for each simulation.  The entire process was then repeated to generate a series standardized on the LIS household-adult-equivalent market-income data.  Measures of absolute redistribution (the difference between the market-income and net-income Gini indices) and of relative redistribution (this difference divided by the market-income Gini and multiplied by 100, that is, the percentage by which market-income inequality is reduced) are calculated. Observations for these measures of redistribution are omitted for countries for which the source data do not include more than three observations of either market- or net-income inequality.\footnote{\doublespacing In such cases, although the two inequality series each still constitute the most comparable available estimates, the difference between them reflects only information from other countries, and treating it as meaningful independent information about redistribution cannot be justified.}  These four series of estimates---of net-income inequality, market-income inequality, absolute redistribution, and relative redistribution---together constitute the SWIID.


\section*{Assessing the SWIID}
Version 5.0 of the SWIID dataset covers 174 countries, with estimates of net-income inequality comparable with the LIS Key Figures for 4631 country-years and estimates of market income inequality comparable with those obtained from the LIS for 4629 country-years.  The SWIID's aim, as noted above, is to provide data for the broadest possible sample of countries and years that are made as comparable as feasible.  On the first criterion, breadth of coverage, the SWIID bests all other inequality datasets.  In fact, it more than doubles the country-year observations in Milanovic's (\citeyear{Milanovic2013}) \emph{All the Ginis}, the next-largest income inequality dataset, and it is more than ten times the size of the Eurostat data, the largest collection calculated on the basis of a single welfare definition and equivalence scale.

This breadth of coverage allows researchers to make comparisons of countries around the world.  As shown in Figure~\ref{F:brics}, the LIS-comparable data on China and India provided by the SWIID tells a very different story than the left and center panels of Figure~\ref{F:adj}.  It reveals that estimated inequality in net incomes had been discernably higher in India than in China until the late 1990s, and with the exception of a few years early in the new millenium when inequality in China was briefly distinctively higher, the difference between the two countries' levels of inequality has not been large or clear since.  This figure further compares India and China with the other two BRICs, Brazil and Russia.  Notably, it shows that the recent downward trend in inequality in Brazil---long thought one of the most unequal countries in the world---has left that country with a more equal distribution of net income than either.

<<brics, echo=FALSE, results='hide', cache=TRUE, warning=FALSE>>==
swiid <- read.csv("SWIIDv5_0summary.csv", as.is=T)

brics <- swiid[swiid$country=="Brazil" | (swiid$country=="Russian Federation") | 
    (swiid$country=="India" & swiid$year>=1974) | swiid$country=="China",]

pdf(file="brics.pdf",width=8, height=5.25)
ggplot(data=brics, aes(x=year, y=gini_net, colour=country)) + geom_line() +
    theme_bw() +
	theme(legend.position="none") +    
	coord_cartesian(xlim=c(1975,2012),ylim = c(20, 65)) +
	labs(x = "Year", y = "SWIID Gini Index, Net Income") +
    geom_ribbon(aes(ymin = gini_net-1.96*gini_net_se, ymax = gini_net+1.96*gini_net_se, 
    	fill=country, linetype=NA), alpha = .25) +
    geom_text(aes(1977,60,label = "Brazil", colour="Brazil"), size=4.5) +
    geom_text(aes(1980,40,label = "India", colour="India"), size=4.5) +
 	geom_text(aes(2003,38,label = "Russia", colour="Russia"), size=4.5) +
	geom_text(aes(1987,35,label = "China", colour="China"), size=4.5) 
graphics.off()
@

[Figure~\ref{F:brics} about here.]

Of course, much hinges on the claim that the SWIID data are, in fact, comparable to the LIS.  A preliminary test concerns the extent to which the SWIID avoids relying on the dubious assumption underlying fixed adjustments that differences in inequality statistics calculated on two dissimilar bases are constant across space and time.  Figure~\ref{F:types} displays for each region the share of each type of adjustment used to calculate the SWIID estimates of net-income inequality from the source data.  First, it is important to recognize that the SWIID completely avoids global fixed adjustments of the sort recommended by, e.g, \citet[582]{Deininger1996}.  At worst, some SWIID estimates for poorer countries are based on relationships observed elsewhere in the developing world, but never on those seen in advanced countries (and for advanced countries, no estimates rely on information from countries outside their region).  This alone constitutes a considerable advance over common practice.
    
<<sbr, echo=FALSE, results='hide', cache=TRUE, warning=FALSE>>==
sbr <- read.csv("7byregion.csv")
sbr$region <- with(sbr, factor(region, levels=region[order(order)]))
 
sbr2 <- melt(sbr[1:8], id="region", variable.name="type")

pdf(file="sbr.pdf",width=8.5, height=6)
ggplot(sbr2, aes(region, value, fill = type)) + 
    geom_bar(stat = "identity") + theme_bw() + theme(legend.position="right") +
    labs(x = "", y = "") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
    scale_fill_manual(values=c("#35505E", "#416F86", "#4E8EAF", "#66CCFF", "#DF6679", "#ca0020", "#000000"),
                      name="Adjustment Type",
                      breaks=c("lis", "low", "cd", "c", "reg", "ad", "g"),
                      labels=c("Unadjusted LIS Data", "Country-Year", "Country-Decade", "Country", "Region", "Advanced/Developing", "Global"), 
                      guide = guide_legend(reverse=TRUE))
graphics.off()
@

[Figure~\ref{F:adj} about here.]

The figure further shows that, where the available inequality data allows, the SWIID estimates are based on much more nuanced---and concomitantly much more realistic---adjustments.  In the advanced English-speaking countries, for example, nearly seven in every ten estimates are taken directly from the LIS or are based on country-year-varying predicted ratios from loess regressions, as are about half of the estimates for the countries of western Europe.  Only about 5\% and 10\%, respectively, of the estimates in these regions are based on ratios observed in other countries.  A similar share of the estimates for Japan and the four Asian Tigers (Hong Kong, South Korea, Taiwan, and Singapore) are likewise based on at least within-country ratios, although the relatively small number of country-years in the LIS limits the extent to which these relationships were modeled as varying over time. Among the world's richer countries, only in ex-Communist central and eastern Europe are a substantial fraction of estimates based on regional averages rather than on information from within the country itself.

In the developing world, on the other hand, most SWIID estimates are based on ratios observed in other countries.  This is a disappointing limitation imposed by the relative paucity of data available for developing countries.  Even here, though, a substantial minority of estimates use within-country ratios, and most others rely only on the averages of regional neighbors.  And as the LIS expands to cover more of the developing world---it has announced that it will soon release data for four more Latin American countries as well as Egypt and Serbia---the SWIID will continue to improve in this respect.  Overall, as depicted in the rightmost column of the figure, more than half of SWIID estimates are based on same-country information on the relationships between inequality statistics calculated on different combinations of welfare definition and equivalence scale.



<<predlis, echo=FALSE, results='hide', cache=TRUE, warning=FALSE>>==
pl <- read.csv("predlis.csv")
pl$cy <- with(pl, factor(cy, levels=cy[order(pred, year, country)]))
pl <- pl[!is.na(pl$gini_swiid), ]

pl2_long <- melt(pl[, c("cy", "gini_lis", "gini_swiid")], id="cy", value.name="gini")

pl2_long_se <- melt(pl[, c("cy", "gini_lis_se", "gini_swiid_se")], id="cy", value.name="se")
pl2_long_se$variable <- gsub(x=pl2_long_se$variable, pattern="\\_se", "") 

pl2 <- merge(pl2_long, pl2_long_se)
    
pl3 <- melt(pl[, c("cy", "diff", "ok")], id=c("cy", "ok"), value.name="diff")

pl3_se <- melt(pl[, c("cy", "diff_se")], id="cy", value.name="se")
pl3_se$variable <- gsub(x=pl3_se$variable, pattern="\\_se", "") 

pl3 <- merge(pl3, pl3_se)
pl3$cy <- with(pl3, factor(cy, levels=cy[order(diff)]))
pl3 <- pl3[order(pl3$cy), ]
row.names(pl3) <- 1:71
pl3$clr <- as.character(ifelse(pl3$ok==1, "black", "slateblue"))

pdf(file="predlis.pdf",width=8.5, height=5.25)
ggplot(pl3) + 
    geom_rect(aes(xmin=-Inf, xmax=Inf, ymin=-2, ymax=2), fill='gray90', alpha=.2) +
    geom_hline(yintercept=0, linetype=2, colour="gray60") +
    geom_pointrange( 
                    aes(x=cy, y=diff, ymin=diff - 2*se, ymax = diff + 2*se, colour = clr)) +
    theme_bw() + theme(legend.position="none") +
    scale_colour_manual(values=c("black", "slateblue")) +
    labs(x = "", y = "SWIID Prediction minus LIS") + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1, size=7, colour = pl3$clr)) +
    scale_y_continuous(breaks=c(-15, -10, -5, 0, 5)) + scale_x_discrete(limits=levels(pl3$cy))   
graphics.off()
@

[Figure~\ref{F:types} about here.]

The expansion of the LIS since the release of the SWIID provides an even more exacting test.  By adopting the LIS as its standard, the SWIID means to provide estimates of what the LIS would show if a given country-year were in fact included in the LIS.  Since 2008, the LIS has added data on 71 country-years that had been already included in the SWIID.  If the SWIID succeeds in providing estimates of the highly comparable LIS figures, the differences between what the SWIID version then available predicted and the data that the LIS released will not be substantively and statistically significant.\footnote{\doublespacing For discussions that inspired this test, I thank participants in the Expert Group Meeting on Reducing Inequalities in the Context of Sustainable Development, Department of Economic and Social Affairs, United Nations, New York, October 24-25, 2013.}

Figure~\ref{F:pred} plots these differences, arranged with the SWIID underestimates of the LIS on the left and overestimates on the right.  A first reassuring observation is that there is no overall tendency in either direction: the median difference is \Sexpr{round(pl3[36, "diff"], 2)} Gini-index points.  Differences larger than two points are considered substantively significant; this level is admittedly but necessarily arbitrary.  In only 5 of these 71 country-years---that is, just 7\%---are the differences between the LIS and what the then-current version of the SWIID predicted both substantively and statistically significant.  This is an impressive record of out-of-sample prediction, and it lends considerable confidence that the SWIID has been providing data for a broad sample of countries and years that are comparable to the LIS, and so in turn across space and time.

<<wiidpred, echo=FALSE, results='hide', cache=FALSE, warning=FALSE>>==
lis <- read.dta("full_kf.dta")
lis <- lis[, 1:3]
names(lis)[1] <- "gini_lis"

tmp <- tempfile(fileext = paste0(".", "xls"))
download.file("http://www.wider.unu.edu/research/WIID3-0B/en_GB/wiid/_files/92393927664936620/default/WIID3b.xls", destfile = tmp, mode = "wb")
wiid <- read_excel(tmp)
names(wiid) <- tolower(names(wiid))

wiid2 <- wiid[grep("Net|Disposable", wiid$welfaredefn), ]
wiid2 <- wiid2[grep("eq", wiid2$equivsc), ]
wiid2 <- wiid2[!grepl("Urban", wiid2$areacovr), ]
wiid3 <- ddply(wiid2, .(country, year), summarize, 
               gini = mean(gini))

lis.wiid <- merge(lis, wiid3, all.x=T)

wiid.comp <- table(with(lis.wiid, abs(gini - gini_lis) < 2))
@

The SWIID's record in predicting the LIS is all the more impressive in comparison to that of other  cross-national income inequality datasets.  Recall the two options implied by the tradeoff between comparability and coverage discussed above.  The first option privileges comparability at the cost of coverage by using only those data generated using the same basis of calculation.  The recently released Version 3b of the \citet{UNU2014} dataset includes \Sexpr{length(wiid3$gini)} country-years with Gini indices calculated on the basis of household-adult-equivalent disposable income, the same combination of equivalence scale and welfare definition as employed to calculate the LIS Key Figures.\footnote{\doublespacing Combining these data---none of which are sourced from the LIS---with the LIS Key Figures provides the broadest country-year coverage of usable observations available for any combination of equivalence scale and income definition in the \citet{UNU2014} data.  For the purposes of this test, when more than one observation of household-adult-equivalent disposable income inequality for a given country-year was available in this dataset, these observations were averaged.}  A substantial number of these country-years, \Sexpr{sum(wiid.comp)}, overlap with the LIS, allowing a test of the comparability of these data similar to the one applied to the SWIID in Figure~\ref{F:pred}.  Given that this approach sacrifices coverage sharply---having only about one-fifth of the observations in the SWIID---for the purpose of maximizing comparability, the results are disappointing: the difference between the UNU-WIDER data and the LIS is substantively and statistically significant in \Sexpr{wiid.comp[1]} country-years, \Sexpr{round(wiid.comp[1]*100/sum(wiid.comp))}\% of the total.  

<<atgpred, echo=FALSE, results='hide', cache=TRUE, warning=FALSE>>==
atg.path <- "/All the Ginis"
atg.names <- c("allginis.dta","BMilanovic_allginis.dta","BMilanovic_allginis.dta","allginis_Oct2012.dta","allginis_2013.dta")
atg.files <- paste0(atg.path, "/", c(2005, 2009, 2010, 2012, 2013), "/", atg.names)

atg.vars <- c("country", "year", "gini_LIS", "Giniall", "Di", "Dhh", "Dg")

trim <- function (x) gsub("^\\s+|\\s+$", "", x)

for (i in 3:5) {
    atg <- read.dta(path.expand(atg.files[i]))
    atg <- atg[!is.na(atg$Giniall), atg.vars]
    atg$country <- trim(atg$country)
    atg$country <- gsub(x=atg$country, pattern="Taiwan.*", "Taiwan")
    atg$Giniall <- round(atg$Giniall, 1)
    atg$lis_data <- atg$Giniall == round(atg$gini_LIS, 1)
    atg$Dc <- 1-atg$Di
    atg$Dg[is.na(atg$Dg)] <- 0
    m <- lm(Giniall ~ Dhh + Dg + Dc, data=atg, na.action=na.omit)
    m$coefficients[is.na(m$coefficients)] <- 0
    atg$adjGiniall <- with(atg, round(Giniall - (Dhh*m$coefficients["Dhh"] + Dg*m$coefficients["Dg"] + Dc*m$coefficients["Dc"]), 1))   
    names(atg)[3:10] <- paste0(names(atg)[3:10], i)
    if (i==3) lis.atg <- merge(lis, atg, all.x=T) else lis.atg <- merge(lis.atg, atg, all.x=T)
    lis.atg <- unique(lis.atg)
}

atg.s <- read.dta(path.expand(atg.files[5]))
atg.s <- atg.s[, c("country", "year", "source_of_data")]
atg.s$country <- trim(atg.s$country)
lis.atg <- merge(lis.atg, atg.s, all.x=T)
lis.atg <- unique(lis.atg)

lis.atg$version <- lis.atg$Giniall <- lis.atg$adjGiniall <- lis.atg$mb_lis <- NA
for (i in 3:5) {
    lis.atg[lis.atg[paste0("lis_data", i)] == T & !is.na(lis.atg[paste0("lis_data", i)]), "mb_lis"] <- lis.atg[lis.atg[paste0("lis_data", i)] == T & !is.na(lis.atg[paste0("lis_data", i)]), paste0("Giniall", i)]
}

for (i in 5:3) {
    lis.atg[(lis.atg[paste0("lis_data", i)] == F | is.na(lis.atg[paste0("lis_data", i)])) & is.na(lis.atg$Giniall) & !is.na(lis.atg[paste0("Giniall", i)]), "version"] <- i  
    lis.atg[lis.atg$version==i & !is.na(lis.atg$version), "Giniall"] <- lis.atg[lis.atg$version==i & !is.na(lis.atg$version), paste0("Giniall", i)]
    lis.atg[lis.atg$version==i & !is.na(lis.atg$version), "adjGiniall"] <- lis.atg[lis.atg$version==i & !is.na(lis.atg$version), paste0("adjGiniall", i)]   
}

lis.hpc <- read.csv("lis_hpc.csv")
lis.hpc$cc <- gsub(x=lis.hpc$cy, pattern="[0-9]{2}", "")
lis.hpc$cc <- gsub(x=lis.hpc$cc, pattern="uk", "gb")
lis.hpc$country <- countrycode(toupper(lis.hpc$cc), origin="iso2c", destination="country.name")
capwords <- function(s, strict = FALSE) {
    cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s}, sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
lis.hpc$country <- capwords(lis.hpc$country, strict=T)
lis.hpc$country <- gsub(x=lis.hpc$country, pattern="Of", "of")
lis.hpc$country <- gsub(x=lis.hpc$country, pattern="Taiwan.*", "Taiwan")
lis.hpc$country <- gsub(x=lis.hpc$country, pattern="Slovak.*", "Slovak Republic")

lis.hpc$year <- gsub(x=lis.hpc$cy, pattern="[a-z]{2}", "")
lis.hpc$year <- gsub(x=lis.hpc$year, pattern="^([01])", "20\\1")
lis.hpc$year <- gsub(x=lis.hpc$year, pattern="^([^2])", "19\\1")
lis.hpc$year <- as.numeric(lis.hpc$year)

lis.atg <- merge(lis.atg, lis.hpc, all.x=T)
write.csv(lis.atg, "lisatg.csv")

lis.atg$diff <- with(lis.atg, Giniall - lis)
atg.unadj <- table(with(lis.atg[lis.atg$version>2, ], abs(diff) < 2))
atg.unadj.fr <- as.numeric(round(atg.unadj[1]*100/sum(atg.unadj)))

lis.atg$diff.adj <- with(lis.atg, adjGiniall - lis)
atg.adj <- table(with(lis.atg[lis.atg$version>2, ], abs(diff.adj) < 2))
atg.adj.fr <- as.numeric(round(atg.adj[1]*100/sum(atg.adj)))
@

Not surprisingly, the second option---to increase coverage at the cost of comparability by using data generated using multiple combinations of equivalence scale and welfare definition---fares even worse than the first.  Beginning in 2010, the \emph{All the Ginis} dataset adopted the LIS as its most preferred source of income inequality data, using other sources to fill in observations of the \texttt{Giniall} series for which LIS data was not available \citep[2-3]{Milanovic2010}.  The result is considerably better coverage than that achieved via the first option---a total of \Sexpr{qq[which(qq$Dataset1=="All the Ginis"), "Observations"]} country-years in the most recent version \citep{Milanovic2013}---but at the cost of mixing observations calculated on eight different bases.  To assess the comparability of these data, for the \Sexpr{as.numeric(sum(atg.unadj))} country-years with non-LIS \texttt{Giniall} data, I took the difference between \texttt{Giniall} (using the latest of the 2010, 2012, and 2013 revision of the dataset with non-LIS data for that country-year) and the Gini index of inequality in household per capita disposable income (\emph{All the Ginis}' preferred combination of welfare definition and equivalence scale) calculated from LIS data.  No fewer than \Sexpr{atg.unadj[1]} of the non-LIS \texttt{Giniall} country-years, or \Sexpr{atg.unadj.fr}\%, were substantively and statistically significantly different from the LIS data now available.  Using the \emph{All the Ginis} dummies to make global fixed adjustments within each revision as suggested by \citet[8]{Milanovic2013} actually yields even fewer good predictions: fully \Sexpr{atg.adj.fr}\% of the adjusted non-LIS \texttt{Giniall} observations are sustantively and statistically significantly different from the LIS.\footnote{\doublespacing Excluding those non-LIS Ginis based on the INDIE series in the 2013 revision, which \citet{Milanovic2013} prefers to the LIS, yields very similar rates: 19 of the 41 differences (46\%) between the unadjusted data and the LIS---and 25 of the 41 (61\%) differences between the adjusted data and the LIS---are substantively and statistically significant.}  The SWIID's approach of using the LIS as a standard and estimating its missing values with all of the available data, using as much information as possible from the same country and proximate years, does a much better job of providing comparable estimates than either of the two more straightforward options presented by the tradeoff of comparability and coverage.

Still, further examination of Figure~\ref{F:pred} raises two issues that give some pause.  First, though countries new to the LIS are generally predicted as well as others, the previous SWIID estimates for India in 2004 and China in 2002 (both from Version 3.1) badly understated the level of inequality in those two countries.  These country-years constituted not only the first LIS observation in each country but also the first LIS observations of the developing countries in Asia.  The relationships between the LIS data and other Gini indices calculated on different bases observed elsewhere in the developing world proved not to provide good estimates for this region.\footnote{\doublespacing The difference between the first LIS observation in Africa, of South Africa in 2008, and the closest previous SWIID prediction for that country, for 2005, was just $4\pm7$ points, suggesting that these relationships are particularly distinctive in developing Asia.}  Now that all regions of the world have at least some representation in the LIS, such large errors are unlikely to persist in the SWIID, but this points again to the continuing need to minimize reliance on information from other countries and particularly other regions.

Second, only \Sexpr{length(pl3$ok[pl3$ok==1])} of these 71 differences, or \Sexpr{round(length(pl3$ok[pl3$ok==1])*100/71)}\%, have 95\% confidence intervals that include zero.  This suggests that the standard errors associated with the SWIID estimates have often been too small.  Some corrections have been made in the code used to generate Version 5 of the SWIID to help ensure that the certainty of the estimates is not overstated.  It is possible, however, that the lack of any consideration of the sampling errors of the Gini indices in the source data continues to be an important cause of overconfidence.  Measures of sampling error---or even the information needed to calculate them---are only rarely provided in the sources of these data, but a reasonable estimate would likely be preferable to the present assumption that the sampling error is zero.  The possibility of incorporating the sampling error in the source data will be considered in future versions of the SWIID.

The sum of these assessments, however, is very positive.  The SWIID covers a broader sample of countries and years than any other income inequality dataset, allowing researchers to investigate differences in levels and trends in inequality that would otherwise go unexamined.  It entirely avoids the dubious assumption that differences in inequality statistics calculated on two dissimilar bases are constant across space and time inherent in global fixed adjustments, and it minimizes reliance on information from other countries and regions.  Finally, the SWIID has done a very good job of predicting LIS data before its release, lending confidence in its cross-national and over-time comparability, and its performance on this score far surpasses that of other cross-national income inequality datasets when  tested similarly.  In light of the foregoing, the SWIID is clearly the best source available for broadly cross-national work on income inequality.


\section*{Using the SWIID for Cross-National Research}
The SWIID can be accessed in two ways.  First, to facilitate straightforward comparisons of levels and trends in income inequality, the SWIID is now available as a user-friendly web application built using RStudio and Shiny.  The web application allows users to graph the SWIID estimates of any of net-income income, market-income inequality, relative redistribution, or absolute redistribution in as many as four countries or to compare these measures within a single country.  Its output can be downloaded with a click for use in reports or articles.

Second, to perform statistical analyses, the SWIID data are available pre-formatted for use with the tools developed for analyzing multiply imputed data in Stata (the \texttt{mi estimate:} prefix) and in R (the \texttt{mitools} package).  These tools automate the process of performing an analysis repeatedly using multiple Monte Carlo simulations and averaging the results \citep[see][]{King2001}.  Although these tools introduce a bit more complexity into an analysis, they are necessary for taking the uncertainty in the SWIID estimates into account.  A review of how to use these tools, along with examples, is included in the download.

Whether using the web application or the pre-formatted data, most researchers will find the net-income inequality series to be the best suited to their needs conceptually.  Market-income inequality, although accurately described as measuring the distribution of income before taxes and transfers are taken into account, cannot be considered `pre-government': a wide range of non-redistributive government policies, from public education and job-training programs to capital-accounts regulations, also shape the income distribution \citep[see, e.g.,][]{Iversen2008, Morgan2013}.  In addition to such market-conditioning policies, market-income inequality also includes the feedback effects of redistributive policies on households' decisions regarding savings, employment, and retirement.  Where robust public pension programs are in place, for example, most households will save little for retirement; as a result, most elderly households will be without market income and market-income inequality will be exaggerated in comparison to settings in which public pensions are less complete \citep[see, e.g.,][]{Bradley2003, Jesuit2010}.  By affecting market-income inequality, of course, market conditioning and policy feedback also mean that measures of absolute and relative redistribution such as those included in the SWIID do not map straightforwardly onto the broader concept of `the effect of government on inequality.'  \citet{Morgan2013} present an example of how market-income inequality, absolute redistribution, and net-income inequality can be used in tandem to investigate how governments affect inequality through both market conditioning and redistribution.

Both the SWIID web application and the pre-formatted data are available at the author's website.  All of the files needed to replicate the SWIID, including all of the source data, are available there as well.  

It bears underscoring that the SWIID represents a particular choice in the balance between comparability and coverage: it maximizes comparability for broadest possible coverage of countries and years.  This makes the SWIID ideal for broadly cross-national work, but it is not the most appropriate choice for all research on income inequality.  Greater comparability can often be achieved when one’s scope of inquiry is narrower.  Though it offers only limited coverage, the Luxembourg Income Study provides higher quality, superior comparability, and greater flexibility than the SWIID; these traits will continue to make it the preferred source for many cross-national studies.  Those studying changes in inequality over time in a single country, further, will often find that examining the national sources found within the SWIID source data and becoming familiar with the exact assumptions and definitions they employ will better meet their needs.  Approaches using all these data sources hold promise for advancing our understanding of economic inequality and its causes and consequences.

\pagebreak

\bibliographystyle{ajps}
\bibliography{FSLibrary}

\end{spacing}

\pagebreak

\begin{figure}[htbp] 
  \caption{The Tradeoff Between Comparability and Coverage}
  \label{F:qq}
  \begin{center}
    \includegraphics[width=5.25in]{qq.pdf}
  \begin{footnotesize}
  \begin{tabular}{p{.1in} p{5.1in}}
  & \emph{Note}: As underscored with contrasting color, only the LIS data are harmonized.  \textsf{D\&S Accept} refers to data presented in \citet{Deininger1996}; \textsf{All the Ginis} to data presented in \citet{Milanovic2013}; \textsf{SEDLAC} to \citet{CEDLAS2013}; \textsf{WIID2c} to \citet{UNU2008}; and \textsf{WIID3b} to \citet{UNU2014}.  Other sources were accessed October 1, 2014. 
  \end{tabular}
  \end{footnotesize}
  \end{center}
\end{figure}

\newpage

\begin{figure}[htbp] 
  \caption{Inequality in China and India}
  \label{F:adj}
  \begin{center}
    \includegraphics[width=5.25in]{adj.pdf}
  \begin{footnotesize}
  \begin{tabular}{p{.1in} p{5.1in}}
  & \emph{Notes}: The left panel depicts the unadjusted \texttt{Giniall} series (Milanovic 2013).  The center panel applies global fixed adjustments to these data, as recommended by Milanovic (2013, 8), to account for differences in their bases of calculation.  The right panel compares the adjusted series with data from the Luxembourg Income Study.  Even with adjustments, the \texttt{Giniall} series yield very different conclusions than the comparable data provided by the LIS.
  \end{tabular}
  \end{footnotesize}
  \end{center}
\end{figure}

\newpage

\begin{figure}[htbp] 
  \caption{Net-Income Inequality in the BRICs Countries, SWIID v5.0}
  \label{F:brics}
  \begin{center}
    \includegraphics[width=5.25in]{brics.pdf}
  \begin{footnotesize}
  \begin{tabular}{p{.1in} p{5.1in}}
  & \emph{Notes}: Solid lines indicate the SWIID's mean estimate of net-income inequality across household adult-equivalents; shaded regions indicate the 95\% confidence interval of these estimates.
  \end{tabular}
  \end{footnotesize}
  \end{center}
\end{figure}

\newpage

\begin{figure}[htbp] 
  \caption{SWIID Adjustment Types by Region}
  \label{F:types}
  \begin{center}
    \includegraphics[width=5.25in]{sbr.pdf}
  \end{center}
%   \begin{footnotesize}
%   \begin{tabular}{p{.1in} p{4.75in}}
%   & \emph{Notes}: 
%   \end{tabular}
%   \end{footnotesize}
\end{figure}

\newpage

\begin{landscape}
\begin{figure}[htbp] 
  \caption{SWIID Predictions of the LIS}
  \label{F:pred}
  \begin{center}
    \includegraphics[width=8in]{predlis.pdf}
  \end{center}
  \begin{footnotesize}
  \begin{tabular}{p{.1in} p{8in}}
  & \emph{Notes}: The points represent the differences between values of the LIS net-income Gini index and estimates for these same observations provided by the SWIID version current at the time the LIS data were made available.  The whiskers trace the associated 95\% confidence intervals of these differences.  The area within $\pm$2 Gini points of zero is shaded gray; differences within this interval are not considered substantively significant.  Differences that do not reach statistical significance are depicted in black; statistically significant differences are blue.  For only 5 of 71 (7\%) country-years are the differences between the predictions in the SWIID and the actual LIS values both substantively and statistically significant. 
  \end{tabular}
  \end{footnotesize}
\end{figure}
\end{landscape}



\end{document}